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Abstract 

Many diseases result from the interactions between genes and tine environment. An efficient metliod lias been proposed for 
a case-control study to estimate the genetic and environmental main effects and their interactions, which exploits the 
assumptions of gene-environment independence and Hardy-Weinberg equilibrium. To estimate the absolute and relative 
risks, one needs to resort to an alternative design: the case-base study. In this paper, the authors show how to analyze a 
case-base study under the above dual assumptions. This approach is based on a conditional logistic regression of case- 
counterfactual controls matched data. It can be easily fitted with readily available statistical packages. When the dual 
assumptions are met, the method is approximately unbiased and has adequate coverage probabilities for confidence 
intervals. It also results in smaller variances and shorter confidence intervals as compared with a previous method for a case- 
base study which imposes neither assumption. 
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Introduction 

Many diseases result from the interactions between genes and 
the environment [1,2]. A case-control study will provide estimates 
of the genetic and environmental main effects (in terms of odds 
ratios), and their interactions (in terms of ratios of odds ratios). It is 
often reasonable to assume that among the non-diseased subjects 
in the study population, the gene under study is in Hardy- 
Weinberg equilibrium (which will be achieved within one 
generation of random mating in any population [3]) and is 
independent of exposures (genes being constitutional and envi- 
ronmental exposures being exogenous, are often uncorrelated to 
each other). Previous studies have demonstrated that imposing the 
dual assumptions of gene-environment independence and Hardy- 
Weinberg equilibrium can greatly improve the statistical efficiency 
of a case-control study [4-8] . 

However, in addition to odds ratios, we may also be interested 
in knowing the relative and absolute risks of subjects with different 
genetic and environmental profiles in the population. The relative 
risk is the ratio of the disease risk for individuals with one specific 
genetic and environmental profile, to the disease risk for those at a 
reference level. While indices of relative risk and odds ratio are 
equally suitable for etiologic inferences, a relative risk (a ratio of 
two risks) is easier to follow than an odds ratio (a ratio of two 
'odds'; but what is an odds?). The risk itself (or the absolute risk, to 
be precise) is also important; it is the disease probability for an 



individual with a specific genetic and environmental profile, and 
should be a clinically valuable index. Unfortunately, a case-control 
design does not provide estimates for the absolute risks; without a 
rare-disease assumption, estimates for the relative risks are also not 
provided. 

A case-base design is an attractive alternative to the case-control 
design [9-13]. In contrast to the case-control study which samples 
the non-diseased subjects in the study base as the control group, 
the case-base study samples the entire study base without regard to 
disease status. The design direcdy produces a relative risk estimate 
without resorting to the rare-disease assumption [9-12]. Recently, 
Chui and Lee [13] described a logistic model for case-base study 
which can be easily fitted using existing statistical software to 
produce odds ratio estimates and, upon one additional step of 
simple calculations of the model parameters, relative and absolute 
risk estimates as well. However, Chui and Lee [13] did not 
elaborate on how to incorporate the assumptions of gene- 
environment independence and Hardy-Weinberg equilibrium into 
the model to further improve statistical efficiency. 

In this paper, we show how to analyze case-base study assuming 
the above dual assumptions. We perform a Monte-Carlo 
simulation to investigate the statistical performance of the 
proposed method. 
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Methods 

Case-Base Study Assuming Gene-Environment 
Independence and Hardy-Weinberg Equilibrium 

Let (7 = 0,1 and 2 represent the number of the variant allele(s) a 
subject carries. We define two dummy variables: Gi and Gi, with 
Gi = 1, if G=l, and 0 if otherwise; G2 = l, if G = 2, and 0 if 
otherwise. Let E be the exposure status of a subject which can be 
in any measurement scale: binary, ordinal or continuous. Let D 
represent the disease status of a subject, with D = l for diseased 
and D = 0 for non-diseased. We assume that the disease risk in the 
study population follows a logistic model: 



log 



Pr(Z)=l|G,£)' 
Pr(Z) = 0|G,£) 

li + ociGi+ac2G2 + liE+yiGiE+y2G2E. 



(1) 



The exp(;u) is the basehne disease odds in the population; 
exp(ai) is the odds ratio of disease for those with G = 1 vs. those 
with G = 0; and exp(a2), the odds ratio for those with G = 2 vs. 
those with G = 0. The exp(/?) is the odds ratio associated with the 
environmental exposure. The exp(y[) and exp()'2) are the odds 
ratios associated with gene-environment interactions. 

In a case-base study, researchers implement two sampling 
schemes: the 'case' and the 'control' sampling schemes [9-13]. The 
case sampling scheme targets the diseased subjects. Let 5i = 1 
indicate that a diseased subject is recruited in the case sample of a 
case-base study; Si = 0, otherwise. In such a case sampling 
scheme: 



Pr (Si = 1 \D,G,E) = Pr (Si = l\D) = (j)iD, 



(2) 



where is a constant between 0 and 1. Note that the sampling 
probability depends only on D, not on G and E. In this sampling 
scheme, the diseased subjects have a constant non-zero probability 
of being recruited 

[Pr(5'i = l|Z)=l,G,£)= Pr(5i = l|Z)=l) = (j6i], whereas die 
non-diseased subjects have a probability of zero of being recruited 
[Pr (Si = \\D = 0, G,E) = Pr (Si = 1 |Z> = 0) = 0] . 

The control sampling scheme targets the entire population (the 
study base) without regard to disease status. Let 5*0 = 1 indicate 
that a subject is recruited in the control sample; So =0, otherwise. 
Such a control sampling scheme is noted as: 



Pr(So = l|AG,£') = 



(3) 



where (^o is a constant between 0 and 1. Note that this sampling 
scheme essentially is a random sampling of the study population at 
large; it depends on neither D, G nor E. 

The two sampling schemes are assumed to be independent of 
each other. A diseased subject can be recruited in a case-base 
study simultaneously in the case sample and in the control sample. 
The probability of a diseased subject entering the case-base study 
as a duplicate sample (So = Si = 1) is: 



Pr(So = Si = l|i)=l,G,£) 

= Pr(So = l|i) = l)x Pr(Si = l|D=l) = .^o0i, 



(4) 



which depends on neither G nor E. The event of Sq + Si > 1 
indicates that a subject is recruited in a case-base study through 
case sampling, control sampling or both. Let n be the probability 



that a diseased subject recruited in a case-base study can be found 
in the control sample, i.e.: 



Ti= Pr(So = l\D= l,G,E,So + Si > 1) 
_Ft(So = 1,So+Si>1\D = 1,G,E) 
Pt(So + Si>1\D = 1,G,E) 

Pr (So = i\D = l,G,E) 



(5) 



Pr ( So = 1 |fl = 1 , G,£) + Pr (Si = 1 |X> = 1 , G,£) - Pr (5o = 5i = 1 1/) = 1 , G,£) 



Again, this depends on neither G nor E. 
The maximum likelihood estimate of n and its variance (see 
Chui and Lee [13]) are: 



>,CN 



«D 



and 



Var(7c) = 



7i(l — n) 



no 



(6) 



(7) 



where «d is the total number of distinct diseased subject recruited 
in the case-base study, and is the number of diseased subjects 
recruited in the control sample. Chui and Lee [13] showed that 
the disease risk in a case-base sample also follows a logistic model: 



Pr(Z»=l|G,£',So + Si>l) 



[Pr (D = 0\ G,E,So + Si > 1) 
= fj,* + aiGi + ct2G2 + pE + yi GiE+y2G2E, 



(8) 



where fi* =fi— log k. 

As in Lee et al. [8], we assume that among the non-diseased 
subjects in the study population, the gene (G) is independent of 
environmental exposure (E) [the first equahty in the following 
Equation (9)] and in the Hardy-Weinberg equilibrium (the second 
equality), i.e.: 



Pr (G\E,D = 0) = Pr (G\D = 0) 

G 



--(l-pfx2^i 



(9) 



= Pr(G = 0|Z) = 0)x exp(Gi \og2 + 5G), 



where p is the allele frequency and d = log 



l-p 



, the log allele 



frequency odds, among the non-diseased subjects in the study 
population. Combining Equations (8) and (9), the likelihood 
function for a case-base study under the assumptions of gene- 
environment independence and Hardy-Weinberg equilibrium is 
found to be (Exhibit SI): 

Pr(AG|£,So + Sl>l) 

expiGilogl+SG+ifD+nGiD+nGiD+PED+YtGiED+yjGiED) (10) 
Ei-o Eg.o expfei log2+5g+fi'd+ocigirf+a2g2rf+jS&(+y,gi£d+y2g2i'<f) 



Model (10) above has exactly the same form of the likelihood 
function of a 1:5 matched case-control study. [The denominator in 
Model (10) has a total of six terms, corresponding to one 'case' and 
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five 'controls'.] Therefore, one can adopt the 'counterfactual 
approach' [8] to fit Model (10). To be precise, one first creates a 
recruitment indicator, R=\, for each and every distinct subject 
actually recruited in the case-base study. Next, one creates a total 
of five counterfactual subjects (all of them with R = 0) to each 
recruited subject; the exposure status (E) of the five counterfactual 
subjects is deliberately set to be exactly the same as the recruited 
subject to whom they are matched, but the disease status (/)) and 
gene [G] are different. (The five subjects represent the five dilferent 
ways of making [Z), G] -different counterfactuals.) Treating R as the 
outcome variable, one then performs a conditional logistic 
regression analysis (using existing statistical software, such as 
SAS) with the following regression equation: Gi log 2-1-5 G -|- 
H*D + aiGiD + a2G2D + pED + yiGiED+y2G2ED, based on 
the above created 1 (factual): 5 (counterfactuals) matched data. 
The results are the conditional maximum likelihood estimates of 
the total 7 parameters in Model (10), together with their variance- 
covariance matrix, E, a 7 X 7 matrix. 

Among the parameter estimates obtained from a fitting of Model 
(10) to data, d is an estimate for log allele frequency odds in 
Equation (9), oti and $2 are estimates for log genetic odds ratios in 

Model (1), p is an estimate for log environmental odds ratio in 
Model (1), and yj and 72 estimates associated with gene- 
environment interactions in Model (1). The fi* estimate from Model 
(10) is to be further combined with the n estimate from Equation (6) 
to provide an estimate for the log baseline disease odds in Model (1): 



vector of parameter estimates. Because n is independent of all the 
other parameters [13], the variance of the estimate (in logit scale) 
is: 

Var j^logit^riskujj =Var[log(^)-l-u)j] 

= Var[Iog (n)] + Var(u»?) ^14) 

"d 

where £2 = Var(ll) is readily available by simply deleting the first 

row and the first column of E. [S from Model (10) plays no role in 
the disease risk estimation.] Let the reference group be those 
subjects in the study population with (Gi =g"],G2 =gl,E = e*), and 
let u* = [ 1 g1 g2 e* g\e* g^e* ] be the gene-environment 
profile vector for them. An estimate of the relative risk is therefore: 

RRu/u* =riskuy' risk„t 

_ exp[log(7t) + uti] / exp[log(7t)-l-u*ti] ^^^^ 
1 -I- exp[log(i)-l-UTj]/ 1 -I- exp[log(m)-l-u*Tj]' 



Using the delta method, the variance of the estimate (in log 
scale) is: 



'ti= l0g(7t)-|-jU* 

Because n is independent of ^u* [13], the variance offi is: 

Var(/i) = Var [log (n) + /?] 

= Var[Iog (n)] + Var(/?) . 



(11) 



1 

l-n 

TCN" 



Var(7c)-I-Var(/?) 
+YaT{j?). 



(12) 



Now, we can estimate the absolute and relative risks. An 
estimate of the (absolute) disease risk for subjects in the study 
population with (Gi =g\,G2 =g2,E = e) is: 



risku = 



exp 



exp(ji+aigi +a2g2+'^e+yigie+y2g2ej 
1 -I- exp(^|i-|-aig-i -1-012^2 +'Pe+yigie+y2g2ej 

[log(m)-l-/?-l-aigi +a2g2+^e+yigie+y2g2e^ (13) 



1 -I- exp [log (n) + j? +'Sigi +a.2g2 +Pe+yigie +y2g2e^ 

exp[log(K)-l-uti] 
1-1- exp[log(w)-l-UTi] ' 



where u=[l gi g2 e g\e g2e] is a 1 x 6 gene-environment 
profile vector and Ti=|^y?' Oi S2 ^ y\ i^] isa6xl 



Var[^log(^RR„/„. jj = (^risk„ - risk„H 



1-71 



-l-wilw'j, (16) 



where w = 



risku — 1 ) u — ( risku* — 1 ) u* 



risk„— risk„* 



Monte-Carlo Simulations 

For simplicity, we assume a binary exposure E {E = 0, 1) and a 
biaUelic gene with genotype G (G = 0,l,2). For the non-diseased 
subjects in the study population, we assume gene-environment 
independence and Hardy-Weinberg equilibrium, with the expo- 
sure prevalence (for E=l) and the allele frequency of the variant 
allele both being set at 0.5. The disease probabilities of subjects in 
the study population are assumed to follow the logistic model 
[Model (1)]. We assume an autosomal recessive genc' \vith a 
genetic odds ratio of 2.0 [a] =0 and 0:2= log (2)], an environ- 
mental odds ratio of 2.5 [fi= log (2.5)] and a gene-environmental 
interaction odds ratio of 2.0 [yj =0 and y2= log (2)]. The disease 
prevalence in the study population is set at 0. 1 . Therefore, the six 
disease risks are: riskG=o,£=o =0.0441, risk(j = i_£=o =0.0441, 
riskG=2,£=o = 0.0845, riskc=o,£=i =0.1034, riskG=i£=i = 
0.1034, and riskG=2,£=i =0.3157, respectively. And the five 
relative risks are (with G = 0,E= 0 as the reference level) 
RR(;=i,£=o = 1-0000, RRg=2,£=o = 1-9155, RRg=o,£=i = 
2.3449, RRg=i,£= 1=2.3449, and RRg=2,£=i =7.1587, respec- 
tively. A case-base study is conducted in the study population 
(population size: 100,000) with a case samphng probability {(jii) of 
0.05 and a control sampling probability (i^g) of 0.005. Under such 
a samphng scheme, the case-base study is expected to recruit a 
total of 500 distinct diseased and 500 distinct non-diseased 
subjects. Exhibit S2 presents the SAS code for simulating data. 

Using the formulas derived by Cheng and Lin [6] , we compare 
the relative efficiencies of the estimates in the conditional logistic 
regression [with the assumptions of gene-environment indepen- 
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vO <Ti CO CO 



dence and Hardy-Weinberg equilibrium, Model (10)] relative to 
the corresponding estimates in Ghui and Lee's [13] unconditional 
logistic regression [without the dual assumptions, Model (8)]. 
[Relative efficiency of A method relative to B method is defined as 
the ratio of the variance of B estimator and that of A estimator. A 
larger-than-one relative efficiency implies a better statistical 
performance (larger power and better precision, etc) for A method 
as compared to B method.] 

In regard to the estimates for relative risks and absolute risks, we 
perform a total of 10,000 simulations to compare the perfor- 
mances of the two approaches (the conditional logistic regression 
with the dual assumptions vs. the unconditional logistic regression 
without the dual assumptions). The means of the estimates for 
relative risks (in log scale) and absolute risks (in logit scale) are 
calculated. The variance of an estimate is calculated as the sample 
variance of the estimates. We also calculate the coverage 
probabilities and the average lengths of the 95% confidence 
intervals for the estimates. [The coverage probability of a 95% 
confidence interval for a parameter estimate is the probability that 
the interval covers the true value of the parameter in a repeated 
sampling (simulation) experiment.] 
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Results 

The relative efficiency of the proposed method (with the dual 
assumptions of gene-environment independence and Hardy- 
Weinberg equilibrium), as compared to Chui and Lee's method 
(without the dual assumptions) [13], is shown in Exhibit S3. It can 
be seen that exploiting the dual assumptions can gready improve 
statistical efficiency. The relative efficiencies are 1.30 (Si), 1.44 

(a2), 1-37 (p), 1.57 (Vi), and 1.85 (72), respectively. 

Table 1 shows the simulation results for the estimates for 
relative and absolute risks. Using either approach, the estimates of 
relative and absolute risks are approximately unbiased. [The bias 
of a parameter estimate is the difference between the mean of the 
parameter estimates in the simulation experiment and the true 
value of that parameter; compare the column labeled 'Estimate' 
and the column labeled 'True value' in Table 1.] The 95% 
confidence intervals also achieve adequate coverage probabilities 
for both approaches. However, the variances and the average 
lengths of the confidence intervals using the present method, 
which imposes the dual assumptions of gene-environment 
independence and Hardy-Weinberg equilibrium, are much 
smaller than Chui and Lee's method [13] which imposes neither 
assumption. 
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Discussion 

Lee et al. [8] previously discussed how to relax the Hardy- 
Weinberg equilibrium for case-control studies. This can also be 
applied to the present context of case-base studies. Assuming only 
the gene-environment independence assumption, the case-base 
likelihood becomes: 



o ^ ^ ^ 



Pr(Z),G|£,So + 5| >1) 

exp ( E| G| + B, G2 + ;i* £) + a I G, Z) + 22 G, Z) + [iED + -/ , G, EZ) + G, £Z)) 
Ei.oEe.o<:xp(£iir, +e.2g2 + l^'tl + ^-igid + '-i2g2d + IIEcl + y,g,Ecl + y2g2Ed)' 

where e, = Iog[Pr(G= l|Z) = 0)/Pr(G = 0|Z) = 0)] and £2 = 
Iog[Pr(G = 2|£) = 0)/Pr(G = 0|Z) = 0)] are geiie-frequency-related 
parameters (log genotype frequency odds among the non-diseased 
subjects in the study population, to be precise). By comparison, the 
hkelUiood function (Model 10) where both assumptions are 
imposed contains only one gene-frequency related parameter [3). 
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If the study population is not homogeneous, but is instead 
composed of a number of population strata, a case-base study is 
also vulnerable to population stratification biases just as a case- 
control study can be. Assume that there are a total of q stratum; 
the case-base likelihood conditioned on stratum indicators, 
t\,...,tq, is: 

Vr(D,G\t\,-,t„E,So+Si>\) 

exp(Gi log2+ Sil,G+ J2 li'tiD + ixiGiD+ixiG2D+PED+yiGiED+y^a2ED) 

i=i 

~ 1 2 q q 

T, E exptei log2+ £ i,t,g+ £ ii,'hd+cixgid+a2g2d+^Ed+yigiEd+y2g2E<I) 

d=Qg=Q i=l 1=1 



The interaction terms x G [ti x D) allows the allele frequency 
odds (the background disease odds) to vary between different 
population strata. To use this model, one needs to know in 
advance the stratum to which each and every study subject 
belongs. 

In this paper, we present a method to analyze the case-base 
study exploiting the assumptions of gene-environment indepen- 
dence and Hardy- Weinberg equilibrium with common statistical 
packages. When both assumptions are met, the simulation results 
show that the method is approximately unbiased and has adequate 
coverage probabilities of the 95% confidence intervals. It also 
results in smaller variances and shorter confidence intervals as 
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